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Estimating Posterior Ratio for Classification: 
Transfer Learning from Probabilistic Perspective 
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Abstract 

Transfer learning assumes classifiers of similar tasks share certain parameter struc¬ 
tures. Unfortunately, modern classifiers uses sophisticated feature representations with 
huge parameter spaces which lead to costly transfer. Under the impression that changes 
from one classifier to another should be “simple”, an efficient transfer learning criteria 
that only learns the “differences” is proposed in this paper. We train a posterior ratio 
which turns out to minimizes the upper-bound of the target learning risk. The model 
of posterior ratio does not have to share the same parameter space with the source 
classifier at all so it can be easily modelled and efficiently trained. The resulting classi¬ 
fier therefore is obtained by simply multiplying the existing probabilistic-classifier with 
the learned posterior ratio. 

Keywords: Transfer Learning, Domain Adaptation. 


1 Introduction 

Transfer learning [T^ ITHl E] trains a classifier using limited number of samples with the help 
of abundant samples drawn from another similar distribution. Specihcally, we have a target 
task providing a very small dataset Vp as well as a slightly different source task with a large 
dataset Vq. The Transfer Learning |T2[ |13l |6] usually refers to procedures that make use of 
the similarity between two learning tasks to build a superior classiher using both datasets. 
In this paper, we focus on probabilistic classihcation problems where the goal is to learn 
a class posterior p{y\x) over Vp, where p{y\x) is the conditional probability of class labels 
given an input x. 

Due to its complexity of parametrization, the predicting function is usually encoded in 
the hardware and executed with great efficiency, thus it is reasonable to look at a composite 
algorithm that consists of two parts: a hxed but fast build-in classiher offering compli¬ 
cated predicting pattern and a light-weight procedure works as an adapter that transfers the 
classiher for a variety of slightly diherent situations. For example, a general-purpose facial 
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recognition bnilt in a camera cannot change its predicting behavior once its model is trained, 
however the camera may learn transfer models and adjnst itself for recognizing a target nser. 
The challenge is, the transfer procedure is expected to response rapidly while learning over 
the entire feature set of the source classiher may slow us down dramatically. 

Intuitively, learning a transfer model does not necessarily need complicated features. 
Since the task is still facial recognition, we can assume that the changes from one classiher 
to another are simple and can be described by a trivial (say linear) model with a few key 
personal features (say hair-style or glasses). The general human facial modelling also plays 
an important role, however, we may safely assume that such modelling has been taken care 
of in the source classiher and remain unchanged in the target task. Thus, we can consider 
the “incremental model” only in the transfer procedure. 

One of the popular assumptions in transfer learning is to “reuse” the model from the 
source classiher by training a target classiher and limiting the “distance” between it and the 
source classiher model. Regularization has been utilized to enforce the closeness between 
learned models [6]. More complicated structures, such as dependencies between task pa¬ 
rameters are also used to construct a good classiher [13]. As most methods require to learn 
two classihers of two tasks simultaneously, some works can take already trained classihers as 
auxiliary models and learn to reuse their model structures HHIEIEI. 

However, reusing the existing model means we need to bring the entire feature set from 
the source task and include them in the target classiher during transfer learning, even if we 
know that a vast majority of them does not contribute to the transition from the source to 
the target classiher. Such an overly expressive model can be harmful given limited samples 
in Pp. Moreover, the hyper-parameters used for constructing features may also be difficult 
to tune since the cross-validation may be poor on such a small dataset Vp. Finally, obtaining 
those features in some applications may be time-consuming. 

Another natural idea of transfer learning is to “borrow” informative samples from the Pq, 
and get rid of harmful samples. TrAdaBoost [3] follows this exact learning strategy to assign 
weights to samples from both Vp and Vq. By assigning high weights to samples contributes 
to the performance in the target task, and penalizing samples that “misleads” the classiher, 
TrAdaBoost reuses the knowledges from both datasets to construct an accurate classiher on 
the target task. The idea of importance sampling also gives rise to another set of methods 
learning weights of samples by using density ratio estimation nnnidni. Using unlabelled 
samples from both datasets, an importance weighting function can be learned. By plugging 
such function into the empirical risk minimization criterion [T6|, we can use samples from the 
Vq “as if” they were samples from Vp. However, such method can not allow “incremental 
modelling” as well, since it learns a full classiher model during the transfer. 

It can be noticed that if one can directly model and learn the “diherence” between target 
and source classiher, one may use only the incremental features which leads to a much more 
efficient learning criteria. 

The hrst contribution of this paper is showing that such “diherence learning” is in fact the 
learning of a posterior ratio which is the ratio between the posteriors from source and target 
tasks. We show learning such posterior ratio is equivalent to minimizing the upper-bound of 
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the classification error of the target task. Second, an efficient convex optimization algorithm 
is given to learn the parameters of the posterior ratio model and is proved to give consistent 
estimates under mild assumptions. Finally, the usefulness of this method is validated over 
various artihcial and real-world datasets. 

However, we do not claim that the proposed method has superior performance against all 
existing works based on extra assumptions, e.g. the smoothness of the predicting function 
over unlabeled target samples[Sl |2]. The proposed method is simply a novel probabilistic 
framework working on a very small set of assumptions and offers the flexibility of modelling 
to transfer learning problems. It is fully expendable to various problem settings once new 
assumptions are made. 


2 Problem Setting 


Consider two sets of samples drawn independently from two probability distributions Q and 
P on { — 1,1} X 


^Q = 

Vp = 




X 


x: 


U) 


)}L 


mi" 

’ y Ji=i 


Q, 

i.i.d. „ 


Vq and Vp are source and target dataset respectively. We denote p(yjx) and g(yjx) as the 
class posteriors in P and Q respectively. Moreover, n <^n'. 

Our target is to obtain an estimate of the class posterior p{y\x) and predict the class 
label of an input xhy y = argmaxp(|/|a;). 

Clearly, if n is large enough, one may apply logistic regression [31 ED] to obtain a good 
estimate. In this paper, we focus on a scenario where n is relatively small and n' is sufficiently 
large. Thus, it is desirable if we can transfer information from the source task to boost the 
performance of our target classiher. 


3 Composite Modeling 

Note that the posterior p{y\x) can be decomposed into 

piy\^) = ■ g{y\x), 

q{y\x) 

where is the class posterior ratio, and the q{y\x) is a source classifier. 

This decomposition leads to a simple transfer learning methodology: Model and learn the 
posterior ratio and general-purpose classiher separately, then later multiply them together 
as an estimate of the posterior . 

The main interest of this paper is learning such composite model using samples from Pp 
and Vq. Now, we introduce two parametric models g{y, X] 6) (or ge for short) and q{y, x; f3) 
(or qp for short) for and q{y\x) respectively. 
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3.1 Kullback-Leibler Divergence Minimization 

A natural way of learning such a model is to minimize the Kullback-leibler (KL) [10] diver¬ 
gence between the true posterior and our composite model. 

Definition 1 (Conditional KL Divergence). 

KL[p||g] = Plog ^|^j^| , 
q[y\x) 

We denote Pf as the short hand of the integral/sum of a function / over a probability 
distribution P on its domain. 

Now, we proceed to obtain the following upper-bound of KL divergence from p to the 
composite model: 

Proposition 1 (Transfer Learning Upper-bound), if < Cmax < oo and 0 < < 1, 

then the following inequality holds 

KL \p\\gg ■ qfi] < KL [pWgeq] + C^axKL [qllq^] + C, (1) 

where C is a constant that is irrelevant to 6 or f3. 

Proof. 


KL \p\\ge 

=KL [p\\ge 

=KL [p\\ge 

<KL [p\\gg 
=KL [p\\ge 


h^] = KL 


P\\9e-q - — 


q]-P log qp + P log q 

q] - 


q{y,x) ^^^'^l \ogqp dyx+ P log q 


q{y,x 

q] + log q - CyaajcQ log qp -F C" 

q] + CmaxKL [gllg/?] + C , 


where C = P log q — CmaxQ log q. Further, 


KL \p\\gg ■ q] + CmaxKL [q\\qp] 

i=l 

1 

- C'max^ q {yq\ i /?) + C” 

^ i=i 

where C is a constant that is irrelevant to 0 or /3. 


( 2 ) 


(3) 

□ 
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We may minimize the empirical upper-bound (|^ of KL divergence in order to obtain 
estimates of 0 and (3. Cmax is an unknown constant introduced in (|^ that illustrates the 
how dissimilar these two tasks are. Such upper-bound in Q formalizes the common intuition 
that “if two tasks are similar, transfer learning should be easy, ” since the more similar two 
tasks are, the smaller the Cmax is, and the tighter the bound is. 

Note that the minimizing (|^ leads to two separate maximum likelihood estimation 
(MLE). The MLE of the second likelihood term of bound (|^ 

1 

P = argmax — ^ log g /3) 

^ i=i 

leads to a conventional MLE of a posterior model, and has been well studied, q can be 
efficiently modeled and trained using techniques such as logistic regression PEo]. Here we 
consider it is already given. However, maximizing the hrst likelihood term, a posterior ratio 

1 " 

e = argmax -^^ogg ^, a; ^; 0) (4) 

^ i=l 

is our main focus. In the next section, we show the modelling and learning of the posterior 
ratio is feasible and computationally efficient. 

3.2 Posterior Ratio Model 

Although it is not necessary, to illustrate the idea behind the posterior ratio modelling, we 
assume p{y\x) and q{y\x) belongs to the exponential family, e.g. p{y\x) can be parametrized 
as: 


p{y\x-, (3) oc exp ^ ^ihi{x))j , 

Given the parametrization model ([^, consider the ratio between p and g: 


(5) 


p{y\x;Pp) 

(l{y\x;Pq) 


« exp y'^{(3p^i - (3g^i)hi{x) 


i=l 


For all — I3q^i = 0, factor feature fi is nullihed, and therefore can be ignored when 
modelling the ratio. In fact, once the ratio is considered, the separate /3p and /3^ does not 
have to be learned, but only their difference 9i = (3p^i — f3q^i is sufficient to describe the 
transition from p to q. Thus, we write our posterior ratio model as 


riyj.x-.e) 




( 6 ) 
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where S = {i|/3p,i — f3q^i ^ 0} and N{x\ 6 ) is the normalization term dehned as 

N{x;e)= ^ q{y\x)exp iy'^Othtix) 

V ieS 

Snch normalization is dne to the fact that we are minimizing the KL divergence between 
p{y\x) and g{y, x; 0 )q{y\x), we need to make snre that g{y, x; 6 )q{y\x) is a valid conditional 
probability, i.e., : 9 e{y,x; 6 ) = 1. 

This modelling techniqne gives ns great flexibilities since it only concerns the “effective 
featnres” {hi}i^s rather than the entire featnre set {hi, h2, ■■■, hm}- In this paper, we assnme 
the transfer shonld be simple, thus the potential feature set only contains “simple features”, 
such as linear ones: hi{x) = Xi,i E S. 

From now on, we simplify 2/Xlies ^*^*(*) using a linear representation 6 ^ f(y,x), where 

f{y,x) = [yha,{x),yha,{x),... ,yha^,{x)], 
where Oi, 02 ,..., am' E S. 

However, this modelling also causes a problem: We cannot directly evaluate the output 
value of this model, since we do not have access to the true posterior q{y\x). Therefore, we 
can only use samples from Vq to approximate the normalization term. 



4 Estimating Posterior Ratio 

Now we introduce the estimator of the class-posterior ratio p{y\x)/q{y\x). Let us substitute 
the model of (|^ into the objective (|^: 

I 

6 = argmax- ^ log g {y^^\ 0 ) 

^ i=l 

- n 1 ^ 

2 = 1 2=1 


The normalization term needs to be evaluated in a pointwise fashion N ( 6, Xp'^ ) , Vajp'' E 'Dp. 




Note that if we have sufficient observations [pq, x) paired with each Xp \ i.e. | x 
Q,x = Xp\ such normalization can be approximated efficiently via sample average: 


,(i) 


j=i 


N{0,x) ~ y J^exp [0^f {y[^\x)) . 


i=i 


However, in practice not many observed samples may be paired with Xp\ Especially when 
a; is in a continuous domain, we may not observe any paired sample at all. We may consider 


6 


-Eq[exp(0T/(y,x)) I x] 



Figure 1: Approximate N{6,x) using nearest neighbours. 


using the neighbouring pairs where x^^^ is a neighbour of Xp'^ to approximate 

N{6,Xp'^), which naturally leads to the idea of /c-nearest neighbours (fc-NN) estimation of 
such quantity (see Figure [^: 

= l exp(0T/(2/W,a;W)), 


where 


^fn'{x^p\k) = {j 


-. (J ) 1 c 


aj^-'^is one of the /c-NNs of x^'^ 


Now we have a “computable” approximation to the posterior ratio model: 

exp {6^f{y,x)) 


gn'(y,x-,e) = 


Nn',k{.X]e) 


The resulting optimization is 


1 

e = argmin £(0; Vp, T^q) = f {yp\ xf) 

^ i=l 

+ ■ S ^ ( 4 '^’ ’ 

jejVq(x^p\k^ 


(7) 


which is convex. Note i represents the negative likelihood. 

Moreover, if we assume that the changes between two posteriors are “mild”, i.e. ||0j|| = 
||/3p j — /3q ill is small, we may use an extra £2 regularization to restrict the magnitude of our 
model parameter 0\ 


argminf'(0) + A|| 0 || 2 , 
e 


( 8 ) 
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where the A is a regularization term and can be chosen via likelihood cross-validation in 
practice. Finally the gradient of ^{0) is given as 




n 


i=l 


i=\ 


gniy,x;6)f{y,x] 


X = 


where E„/ \Z\x = a;h)j jg the empirical /c-NN estimate of a conditional expectation over Q\ 


E„/ 


■ 

" 

z 

X = x^^^ 

_ 

_ 


1 

k 




Z,. 


The computation of this gradient is straightforward, and thus we can use any gradient-based 
method such as quasi-newton to solve the unconstrained convex optimization in ([^. 

It can be noticed that such algorithm is similar to the density ratio estimation method, 
KLIEP |T5]. Indeed, they are all estimators of learning a ratio function between two prob¬ 
abilities based on maximum-likelihood criteria. However, the proposed method is different 
from [12] in terms of modelling, motivation and usage. 


5 Consistency of the Estimator 


In this section, we analyze the consistency of the estimator given in Q, i.e. whether the 
estimated parameter converges to the solution of the population objective function. This 
result is not straightforward since we used an extra fc-NN approximation in our model so 
that the model itself is an “estimate”. The question is, does this approximation lead to a 
consistent estimator? 

First, we dehne the estimated and true parameter as: 

6 = argmax£(0;T>p,T>Q) = \oggn'{y,x-,6) 

0 

0* = argmaxP log g{y,x-,0), 

0 


where P„ is the empirical measure of distribution P. 

Assumption 1 (Bounded Ratio Model). There exists 1 < M^ax < oo, so that \6~^ f{y, aj)| < 
log(Mmax)- Moreover, 0 is in a totally bounded metric space and maxy^a, \\f{y,x )\\2 < Pmax 
where 0 < Pmax < C) 0 . 


Therefore exp (^O^f{y,x)) ,N{x]0) and Nn\k{x',0) 




-,M„ 


and the posterior 

ratio model is always bounded by constants. It is a reasonable assumption as the posterior 
ratio measures the “differences” between two tasks, the true posterior ratio must be close to 
one if two tasks are similar. 











Assumption 2 (Bounded Covariate Shift). < -Rmax- 

The support between P and Q must overlap. If samples in Vp distribute completely 
differently from those in Pq, it does not make sense to expect the transfer learning method 
would work well. 

Assumption 3 (Identihability). 6* is the unique global maximizer of the population objective 
function P log g{y,x;0), i.e. for all e > 0, 

sup Plogg{y,X]6) < P log g{y,x] 6*). 

e,\\e-e’*\\>e 

Then we have the following theorem that states our posterior ratio estimator is consistent. 

Theorem 1. Suppose for each x, the random variable ||A — a;|| is absolutely continuous. 
If n ^ oo, n' —>■ oo, fc^z/logn' —)• oo and kn'/n' —)■ 0, where kn' is the sample dependent 
version of k, the number of nearest neighbors used in k-NN approximation. Then under 
above assumptions, 0 A 0 T Further i{e-,Vp,VQ) ^KL\p\\q]. 

The proof relies on the following lemma: 

Lemma 1. Under all assumptions stated above, if n ^ oo, n' —)■ oo, kn'f logn' —)■ oo and 
kn'/n'^0. ThensupglPn loggn'{y,x-,6) — P logg{y,X]6)\ ^ 0, i.e. the error caused by 
approximating objective using samples converges to 0 in probability uniformly w.r.t. 6. 

One of the key steps is to decompose the above empirical approximation error of the 
objective function into: Approximation error caused by using samples from P + Modelling 
error caused by fc-NN using samples from Q. It can be observed that the bound of density 
ratio i?max also contributes to the error. The complete proof is included in the appendix. 


6 Decomposing Paramter vs. Decomposing Model 

Instead of decomposing the model = g 0 hf 3 as we propose in this paper, the Model-reuse 
methods (e.g. [6l [13]) decompose the parameter: /3p = 6 + f3^, which leads to a problem of 
minimizing a KL divergence 

minKL [p||h(0 +/3g)] . 

Two issues come with this criteria. First, this problem is not identihable since there exist 
inhnitely many possible combinations of 0 and [3^ that minimizes the objective function. 
One must use extra assumptions. Model-reuse methods add a “regularizer” on parameter 
(3g using KL-divergence. 

^ ,0 = argminKL [p\\h {6 + f3 )] + yKL [q\\h{f3 )] , (9) 
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(a) e{e:Vp,VQ) 


(b) negative hold-out likelihood (c) Illustration 

dataset shift 


of 4-Gaussian 
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• j:,. Negative 

• a^p. Positive 



(d) p{y\x\0p), miss-rate: 13.8%. 


(e) g(?/|a;; miss-rate: 15.2% (f) g{y,x-,0)q{y\x-,$^), miss-rate: 

8 . 0 % 


Figure 2: Experiments on artificial datasets 


which implies that the minimizer /3^ should also make the difference between q and h(/3g) 
small, in terms of KL divergence, and 7 is a “balancing parameter” has to be tuned using 
cross-validation which may be poor when the number of samples from T>p is low. As we will 
show later in the experiments, the choice of 7 is crucial to the performance when n is small. 

Second, since the model must be normalized, i.e. / h{6 + f3g) dy = 1, so (3^ and 6 are 
always coupled, one must always solve them together, meaning the algorithm have to handle 
the complicated feature space for (3^ and 6. 

However, things are much easier if we have access to the true parameter of the posterior 
/3*, then we can model the posterior oi p g{y , X] 0)q{ii\x] j3*q) , where g is the model of the 
ratio. This setting leads to the proposed posterior ratio learning method: 

e = argminKL [p\\g{y,x;e)q{y\x; f3*)] . 
e 

where f3* is a constant, so this optimization is with respect to 6 only. This paper presents 
an algorithm that can obtain an estimate of g{y, X] 0) even if one does not know q{y\x-^ f3*) 
exactly. q{y\x; f3q) is learned separately and is multiplied with g{y,x;0) in order to provide 
an posterior output. In comparison, the decomposition of model results two independent 
optimizations and we are free from the join objective where the choice of the parameter 7 is 
problematic. Neither do we have to assume that 6 and /3^ are in the same parameter space. 
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(a) sci.crypt 


(b) sci.electronics 


(c) sci.med 



(d) sci.space 


(e) talk.politics.guns 


(f) talk.politics.mideast 



n 


(g) talk.politics.misc (h) talk.religion.misc 


Figure 3: 20 News datasets. 


7 Experiments 


We fix the feature function / as f{x, y) := y [a;, 1]"'' . It is consistent with our “simple transfer 
model” assumption discussed in Section 


7.1 Synthetic Experiments 

KL convergence The first experiment uses our trained posterior ratio model to ap¬ 
proximate the conditional KL divergence. Since our estimate 0 —)■ 0*, we hope to see 
1(6]Vp,Vq) — KL [p\\q] —)■ 0 as n,n' —>■ oo. We draw two balanced-classes of samples from 
two Gaussian distributions with different means for P and Q. Specihcally, for y = {—1,1}, 
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(a) kitchen 




(b) dvd (c) books 


Figure 4: Amazon sentiment datasets. 


we construct P and Q as follows: 

g(a;|l) = Normal(2, l),q{x\ — 1) = Normal(—2,1), 
p(a;|l) = Normal(1.5, l),p(a;| — 1) = Normal(—1.5,1). 


We draw 5k samples from distribution Q, n samples from P, and k is chosen to minimize the 
error of conditional mean estimation (same below, as it is introduced in the appendix), then 
train a posterior ratio g{y, x; 0). By varying n and random sampli ng, we may create a plot for 
averaged Pp, Pg), with standard error over 25 runs in Figure 2(a) The true conditional 
KL divergence is plotted alongside as a blue horizontal dash-line. To make comparison, we 
run the same estimation again with 50k samples from Q, and plot in red. 

The result shows, our estimator does converge to the true KL divergence, and the estima¬ 
tion error shrinks as n —)■ oo. Increasing n' also help slightly reduce the variance (comparing 
the blue error bar with the red error bar). However, such improvement is not as signihcant 
as increasing n. 


Joint vs. Separated In this experiment, we demonstrate the effect of introducing a “bal¬ 
ancing parameter” 7 of the joint optimization method discussed in Section We simply 
reuse the dataset in the previous experiment, and test the averaged negative hold-out likeli¬ 
hood of the approach described in ([^ and the proposed method using Pp of various sizes. 
It can be seen that the choice of the parameter 7 has huge effect on the hold-out likelihood 
when n is small. However, the proposed method is free from such parameter and can achieve 
a very low likelihood even when using only 10 samples from Pp. 


4-Gaussian The second experiment demonstrates how a simple transfer model helps trans¬ 
fer a non-linear classiher. The dataset Pg is constructed using mixtures of Gaussian distri¬ 
butions with different means on horizontal axis and two classes of samples are not linearly 
separable. To create dataset Pp, we simply shift their means away from each other on the 
vertical dimension (See Figure [2(c) [ ). We compare the posterior functions learned by kernel 
logistic regression performed on Pp (Figure 2(d)) and Pg (Figure |2(e)[) with the proposed 
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transfer learning method (Fignre [2(f)[ ) which is a mnltiplication of the learned g{y, X] 6 ) and 

We set n = 40, n' = 5000. It can be seen from Fignre 2(d) that althongh kernel logistic 
regression has learned the rongh decision bonndary by nsing Vp only, it has completely 
missed the characteristics of the posterior fnnction near the class border dne to lack of 
observations. In contrast, bnilt npon a snccessfnlly learned posterior fnnction on dataset 
Vq (Fignre 2(e)), the proposed method snccessfnlly transferred the posterior fnnction for 
the new dataset Pp, even thongh it is eqnipped only with linear featnres (Fignre [2(f)[ ). The 
classihcation bonndary it provides is highly non-linear. 


7.2 Real-world Applications 

20-news Experiments are rnn on 20-news dataset where articles are gronped into major 
categories (snch as “sports”) and snb-categories (snch as “sports.basketball”). In this ex¬ 
periment, we adopt “one versns the others” scenario: i.e. The task is to predict whether an 
article is drawn from a snb-category or not. We hrst constrnct Pp by randomly selecting a 
few samples from a certain snb-category T and then mix them with eqnal nnmber of samples 
from the rest of the categories. T>q is constrncted using abundant random samples from the 
same major- but different sub-categories and random samples from all the rest categories as 
negative samples. We adopt PCA and reduce the dimension to just 20. 

Figure summarizes the miss-classihcation rate of the proposed transfer learning algo¬ 
rithm and all the other methods: LogiP logistic regression on T>p, LogiQ logistic regression 
on Vq, TrAdaBoost [1], Reg [6], CovarShift [T51I9] and Adaptive [TH] over different sub¬ 
category T in the “sci” and “talk” category. The result shows that the proposed method 
works well in almost all cases, while the comparison methods Reg CovarShift and TrAd¬ 
aBoost, some times have difficulties in beating the naive base line LogiP and LogiQ. In 
most cases. Adaptive cannot improve much from LogiP. 


Amazon Sentiment The hnal experiment is conducted on the Amazon sentiment dataset, 
where the task is to classify the positive or negative sentiment from user’s review comments 
on “kitchen, electronics, books and dvds”. Since some of the products (such as electronics) 
are far better reviewed than the others (such as kitchen tools ), it is ideal to transfer a 
classiher from a well-reviewed product to another one. 

In this experiment, we hrst sample Vp from one product T and construct dataset Vq 
using all samples from all other products. We apply locality preserving projection [S] to 
reduce the original dimension from ~ 148000 to 30. 

The classihcation error rate is reported in Figure|^for T = “kitchen”, “dvd” and “books”. 
We omit the T = “electronics” since it is noticed that logiP and logiQ has very close 
performance on this dataset suggesting transfer learning is not helpful. 

It can be seen that the proposed method has also achieved low miss-classihcation rate 
on all three datasets, even though Adapvtive gradually catches up when n is large enough. 
Interestingly, Figure 4(b) and 4(c) show that logiQ can achieve very low error rate, and the 
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proposed method manage to reach similar rates. Even if the beneht of transferring is not 
clear in these two cases, the proposed method does not seem to bring in extra errors by also 
considering samples from target dataset T>p which could have been misleading. 

8 Conclusions 

As modern classihers get increasingly complicated, the cost of transfer learning become major 
concern: As in many applications, the transfer should be both quick and accurate. To reduce 
the modeling complexity, we introduce a composite method: learn a posterior ratio and the 
source probabilistic classiher separately then combine them together later. As the posterior 
ratio allows the incremental modeling, features, no matter how complicated, can be ignored 
as long as they do not participate in the dataset transfer. The posterior ratio is learned via 
an efficient convex optimization and is proved consistent. Experiments on both artificial and 
real-world datasets give promising results. 
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Appendix, Proof for Lemma 1 

Proof. First, we decompose supremum of the approximation error of the empirical objective 
function: 


sup 

e 

\Pnloggn'{y,x]e) - 

Plogg{y,X]e)\ 

= sup 
e 

\{Pn 


Pn 

logA^„/ {6]x)) - 

{P6 

^ - PlogN {O-.x)) 1 

< sup 

9 

\{Pn 


x) 

+ 

sup 

9 

\PnlogNn> {0;x) 

- piogN {e-x)\ 

< sup 

9 

\{Pn 

- P)e^t{y, 

x) 

+ 

sup 

9 

\iPn 

-F)loglV„. 

(0; 

x) + 

PlogNn' {0]X) - 

-p 

log (0; x) 

< sup 

9 

\{Pn 

- P)e'"f(y, 

x) 

+ 

sup 

9 

\{Pn 

-F)\ogNn' 


*)l + 

sup 

\P log Nn' {0]X) - 

-PlogAr(6>;a;)| 


9 


( 10 ) 


The hrst two terms in (10) is due to the approximation using samples from Dp, while the 
third term is the model approximation error caused by using fc-NN to approximate N{x]6). 
The hrst two terms are relatively easy to bound. The Uniform Law of Large Numbers 
(see, e.g. Lemma 2.4 in m) can be applied to show the hrst two terms converges to 0 in 
probability, since i. 6 is compact, ii. both 6 ~^ f{y,x) and log Nn' are continuous over 0, iii. 
both above functions are Lipschitz continuous as we will show later. As to the third term, 
we hrst prove for all e > 0 


Prob 


^sup P log Nn' {0] x) — P log N [0] x) > 


^ 0 
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by using the following inequality: logo — log6 < 

supP log iV„/ (0; a;) — P logA^(0;a;) 
e 


< sup P 

e 

< sup P 


Nn' {6- x)-N {6- x) 
N{0-x) 

Nn' {e-x)-N{e-x) 


N{e;x) 


<MmaxSupP \Nn'i6 ;x) - N {e;x)\ 


<R 

max MmaxSUpQ \Nn'{6 ;x) - N {0;x)\ 
e 


( 11 ) 


To show the hnal line converges to 0 with probability one, we use the Generic Uniform Law 
of Large Numbers (Generic ULLN) (see [1] Theorem 1.): 

Theorem 2 (Generic ULLN). For a random sequence {Gn{d),d G 0,n > 1}, if Q is a 
totally bounded metric space, Gn{6) is stochastic equicontinous (SE) and |G„(0)| A O,V0, 
then supg Gn{6) A 0 as n ^ oo. 

Since by assumption, 6 is bounded. We now verify the rest two conditions of this theorem. 
The universal consistency of fc—NN has been proved (see [7], Theorem 23.8, 23.7). Here we 
restate the results for our conveniences: 

Theorem 3 (Universal consistency of KNN). Given Z is bounded, assume that for each x, 
the random variable ||X — a;|| is absolutely continuous, if kn'/ ^ogn' —)■ oo and kn'/n' —?• 0, 
kn’—NN estimator is strongly universally consistent, i.e.. 


lim 

n'^oo 



E[Z\X 


2 


X 


dfi{x) —)■ 0 


with probability one for all distributions {Z,X), where yi{x) is the probability measure of x. 
From Jensen’s inequality, we have 



E[Z\X 


X 


dpi{x)) 


< 


z,-'e.[z\x = ^] 


dfi{x), 


2 


and it can be seen that the left hand side also converges to 0 in probability. By using t 
tinuous Mapping Theorem, we can hnally show that J \ , ^,{x) U ~ ® [-^ A = *] 

converges to 0 in probability. 


le Con- 
dpi{x)) 
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We let Zq 


= exp(0T/(F,X)) be a new random variable and thus we have samples 
{Z 0 ,X) drawn from distribution Q, and 


Q \iNn'ie;x)-N{e;x))\ 
1 


= J h(*) 


k 


ze,,-E[Ze\X = x] 




dfi{x). 


By applying the Theorem]^ we can conclude, such Q |iV„' (^; x) — N (0; a;)| converges 0 in 
probability for all distribution {Z 0 ,X) indexed by parameter 0. Next, we verify the SE of 
Q \Nn' {0; x) — N (0; a;)|. Given Assumption]^ we have 


Q (0; x)-N{0-x)\-Q |iV„, {0- x) - N {0; x)\ 

<Q \Nn' {0-, x) - Nn> (61'; x) + N (0'; x) - N (0; x) \ 

< 2 M ^^^ F ^^^\\0 - 0 ' h . ( 12 ) 


The last line is due to Mean-value Theorem: 


exp { 0 ^f{y,x)) - exp { 0 '^ f{y,x)) 

<11^ - ^'\\\\f{y,x)exp{ 0 '^f{y,x))\\ 
<||0 - 0'||FmaxAfmax, 


where 0 is a vector in-between 0 and 0 ' elementwisely. 

In fact, (12) shows the function Q |A„/ {0]x) — N {0]x)\ is Lipschitz continuous with 
respect to 0, and according to Lemma 2 in [T], it implies SE. Similarly, one can show that 
Nn'{x]0) is Lipschitz continuous. 

Now we can utilize the property of i. boundedness of 0, ii. SE and iii. universal 
consistency to conclude that 


supQ \Nn'{0;x) — N {0;x)\ A-0, 
0 


and due to (0: 


Prob ( sup P log Nn' {0; x) — P log N {0]x) > e\ —>-0. 


Similarly, one can prove that 


Prob ( sup P log N (0; x) — P log Nn' {0] x) > e ] —>-0. 


As a consequence, the third term in (10) -AO. 


□ 
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After obtaining Lemma the rest is similar to the proof of Theorem 9.13 in im. Let 
M{e) := P\ogg{y,x]e) and Mn,n'{0) ■= Pn'^oggn'iy.x^O) and 

M{e*)-M{e) = M{e*) - Mr,,n'{0) + M^y{e) - M{e) 

< M{6*) - Mn,n'ie*) + Mn,n'{e) - M(6) 

< M{e*) - Mn,n'{e*) + sup - M{e)\. 

e 

The last line converges to 0 in probability is proved in Lemma [Tj Therefore, we can write: 

Ve > 0, P{M{e*) - M{G) > e) ^ 0. 

Due to Assumption]^ for an arbitrary choice of eo > 0, if ||0* — 0|| > eo, there must be a 
e > 0, so that M{0*) — M{0) > e. Therefore, we conclude 

Veo > 0, P{\\e* - 0 || > eo) < P{M{G*) - M{0) > e) ^ 0. 

Also, Mn{6) — M{0*) = Mn{6) — M{6) + M{6) — M{6*). Due to Lemmait converges to 
0 in probability. Therefore, we have 1(6]Vp,Vq) A KL [p\\q\. 


Tuning Parameters in Posterior Ratio Estimation 


k in fc-NN: As it is mentioned in Section 
based on the testing criterion: 


k is tuned via 5-fold cross validation, and is 


/ 


MSE = 


\V 


'HO 






A’ - ^ 




7{jj) 

<? 


V 




(13) 


/ 


where Dho is a holdout dataset and Zq'^ = exp ^0'/(i/g*'*, '*) j. However, such value 

depends on 6 and it changes every iteration during the gradient decent. Instead of tuning 
k after each iteration, we follow a simple heuristics: 1) Fix k and run gradient descent. 2) 


^(' 0 ' 


choose a suitable k that minimizes (13). 1) and 2) are repeatedly carried out until converge. 
Such heuristics have very good performance in experiments. 
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